feat(pandas): Flopy pandas support by scottrp · Pull Request #1955 · modflowpy/flopy

scottrp · 2023-09-19T17:10:15Z

This PR is the first step towards integrating Pandas into Flopy. This integration takes place in the MFPandasList and MFPandasTransientList classes (MFPandas*), which are used instead of the MFList and MFTransientList classes under the following conditions:

The data is in a package is not an considered advanced
The data type meets certain criteria, like it is not a jagged list with variable column length
The flopy simulation data option use_pandas is set to true (default value is true)

The MFPandas* classes currently support the same interface as MFList and MFTransientList, and should behave similarly to the end-user. However, MFPandas* stores data internally in a Pandas Dataframe and reads and writes data using Pandas “read_csv” and “to_csv” methods, which can be significantly faster than flopy’s current file reading. The MFPandas* classes set_data methods support DataFrames and their new “get_dataframe” method returns data in a Panda’s Dataframe (“get_data” still returns a recarray).

Remaining work on this PR includes:

When reading files do not use python’s “tell” method to record the start and finish of data (this can be problematic for files opened as text).
Convert recarrays to Pandas using the “from_records” method instead of the dataframe constructor.
Remove cellid tuples support from Flopy. Flopy will only accept cellids stored in separate layer, row, and column fields (or appropriate fields for the discretization) instead of also supporting cellids as a single field with a tuple (layer, row, column). All flopy lists (including the old MFList* classes and the new MFPandasList* classes) will store each component of the cellid in a separate column. This feature may or may not be part of this PR depending on timing.

codecov · 2023-09-19T17:15:42Z

Codecov Report

Merging #1955 (061dcbe) into develop (6e23400) will decrease coverage by 1.0%.
The diff coverage is 22.6%.

@@            Coverage Diff            @@
##           develop   #1955     +/-   ##
=========================================
- Coverage     72.6%   71.7%   -1.0%     
=========================================
  Files          257     258      +1     
  Lines        57800   57412    -388     
=========================================
- Hits         42017   41179    -838     
- Misses       15783   16233    +450

Files	Coverage Δ
flopy/mf6/data/mfdata.py	`75.8% <100.0%> (+0.1%)`	⬆️
flopy/mf6/data/mfdataarray.py	`60.9% <ø> (-0.1%)`	⬇️
flopy/mf6/data/mfdatalist.py	`71.0% <100.0%> (-0.4%)`	⬇️
flopy/mf6/data/mfdatascalar.py	`60.5% <ø> (-0.2%)`	⬇️
flopy/mf6/mfsimbase.py	`67.5% <100.0%> (+<0.1%)`	⬆️
flopy/mf6/modflow/mfgwfchd.py	`100.0% <ø> (ø)`
flopy/mf6/modflow/mfgwfdrn.py	`100.0% <ø> (ø)`
flopy/mf6/modflow/mfgwfevt.py	`100.0% <ø> (ø)`
flopy/mf6/modflow/mfgwfevta.py	`100.0% <ø> (ø)`
flopy/mf6/modflow/mfgwfghb.py	`100.0% <ø> (ø)`
... and 17 more

... and 58 files with indirect coverage changes

wpbonelli · 2023-09-25T16:57:23Z

linking back to comment on original PR, sorry again for the accidental close

jlarsen-usgs · 2023-09-25T19:12:48Z

+        """
+        self._set_data(data, check_data=check_data)
+
+    def set_record(self, data_record, autofill=False, check_data=True):


should this be "set_data_record" to be consistent with the variable that it is setting? Or set_control_record?

Is the hidden and exposed method necessary for this? set_record() is only calling _set_record()

Variable name changed to be more consistent with method name. Changed variable name instead of method name since method name change would breaking existing interface.

Having the hidden method is not necessary. I was originally doing this to make sure the correct method (parent vs child class) got called. But it is better to just explicitly define this, which I am now doing.

jlarsen-usgs · 2023-09-25T19:13:18Z

+        """
+        self._set_record(data_record, autofill, check_data)
+
+    def _set_record(self, data_record, autofill=False, check_data=True):


should this be _set_data_record to be consistent with the variable it is setting?

Renamed variables to be consistent with method name

jlarsen-usgs · 2023-09-25T19:18:37Z

+        self._resync()
+        try:
+            # convert to tuple
+            tuple_record = ()


I'm not sure I understand why lists are being converted to tuples here. It looks like there is support for lists in .append_data.

And if this is a single list record, could this just be tuple(record).

Removed the conversion code and now passing list instead of tuple.

jlarsen-usgs · 2023-09-25T19:20:15Z

+                ex,
+            )
+
+    def update_record(self, record, key_index):


this would be clearer if key_index was kper or stress_period, etc...

This interface is also used for packages like Time-Array Series whose "TIME" block (BEGIN TIME <tas_time>) has a key that is not a stress period. Similarly, the Observation package "CONTINUOUS" block (BEGIN CONTINUOUS FILEOUT <obs_output_file_name>) has a key that is a file name. I therefore choose the generalized name "key_index".

jlarsen-usgs · 2023-09-26T18:56:01Z

+                message = (
+                    f"ERROR: Data list {self._data_name} supplied the "
+                    f"wrong number of columns of data, expected "
+                    f"{len(self._data_item_names)} got {len(data[0])}."
+                )
+                type_, value_, traceback_ = sys.exc_info()
+                raise MFDataException(
+                    self._data_dimensions.structure.get_model(),
+                    self._data_dimensions.structure.get_package(),
+                    self._data_dimensions.structure.path,
+                    "setting list data",
+                    self._data_dimensions.structure.name,
+                    inspect.stack()[0][3],
+                    type_,
+                    value_,
+                    traceback_,
+                    message,
+                    self._simulation_data.debug,
+                )


It seems like we could either return the name of the missing column (or extra column) here to provide a more detailed error message.

I can not easily determine the name of the missing column since this code accepts Pandas Dataframes with incorrect column names (it corrects the column names below, which is trivial to do given that the Dataframe has the correct number of columns). I did however make a more detailed error message that lists the data column names supplied and expected.

jlarsen-usgs · 2023-09-26T20:05:12Z

+        if isinstance(dataset_one, mfdataplist.MFPandasList) or isinstance(
+            dataset_one, mfdataplist.MFPandasTransientList
+        ):


This could be simplified to isinstance(dataset_one, (mfdataplist.MFPandasList, mfdataplist.MFPandasTransientList))

jlarsen-usgs · 2023-09-26T20:05:45Z

+                assert isinstance(
+                    dataset, mfdataplist.MFPandasList
+                ) or isinstance(dataset, mfdataplist.MFPandasTransientList)


isinstance simplification here too

jlarsen-usgs · 2023-09-26T20:08:12Z

        write_headers=True,
        lazy_io=False,
+        use_pandas=True,


What does write_headers write?

write_headers: bool When true flopy writes a header to each package file indicating that it was created by flopy.

jlarsen-usgs · 2023-09-26T20:09:46Z

+                elif isinstance(
+                    value, mfdatalist.MFTransientList
+                ) or isinstance(value, mfdataplist.MFPandasTransientList):


The isinstance statement can be simplified `isinstance(value, (mfdatalist.MFTransientList, mfdataplist.MFPandasTransientList))

jlarsen-usgs · 2023-09-26T20:10:20Z

+                elif isinstance(value, mfdatalist.MFList) or isinstance(
+                    value, mfdataplist.MFPandasList


Same isinstance comment.

spaulins-usgs requested a review from jlarsen-usgs September 19, 2023 18:37

jlarsen-usgs reviewed Sep 26, 2023

View reviewed changes

feat(pandas support)

061dcbe

aleaf mentioned this pull request Sep 29, 2023

feature: universal get_dataframe() method #1969

Closed

jlarsen-usgs approved these changes Oct 4, 2023

View reviewed changes

spaulins-usgs merged commit f82fdf6 into modflowpy:develop Oct 4, 2023

wpbonelli mentioned this pull request Aug 27, 2025

bug: External file write fails when passing cellid as [k,i,j] list in stress period data #2583

Closed

		elif isinstance(value, mfdatalist.MFList) or isinstance(
		value, mfdataplist.MFPandasList

Conversation

scottrp commented Sep 19, 2023

Uh oh!

codecov Bot commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wpbonelli commented Sep 25, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Sep 19, 2023 •

edited

Loading